Dataframes

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Warm up
  • What is a dataframe?
  • Observations
  • Variables
  • Recoding variables
  • Scope of data

Warm up

R learning curve

What is a dataframe?

Table

It s a generic name. It can be almost anything.

  • Periodic table
  • Multiplication table
  • Truth table
  • Chi squared table
  • Phonetic table

Data(s)

  • Source of information (SI): Raw empirical material.
  • Data (s/p): Collected, processed, systematized and organized SI (Van Evera 2009).
    • Numbers, characters, symbols … no meaning.
  • Database: An organized collection of data stored and accessed electronically / An organized collection of data stored as multiple datasets.
  • Dataset: A structured collection of data generally associated with a unique body of work.

Spreadsheet

How Excel stores data in two dimensions:

Dataframe

A way1 to store data in R in two dimensions: rows and columns2:

# A tibble: 17,548 × 9
   scode country      year polity2 xrreg xrcomp xropen xconst parreg
   <chr> <chr>       <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AFG   Afghanistan  1800      -6     3      1      1      1      3
 2 AFG   Afghanistan  1801      -6     3      1      1      1      3
 3 AFG   Afghanistan  1802      -6     3      1      1      1      3
 4 AFG   Afghanistan  1803      -6     3      1      1      1      3
 5 AFG   Afghanistan  1804      -6     3      1      1      1      3
 6 AFG   Afghanistan  1805      -6     3      1      1      1      3
 7 AFG   Afghanistan  1806      -6     3      1      1      1      3
 8 AFG   Afghanistan  1807      -6     3      1      1      1      3
 9 AFG   Afghanistan  1808      -6     3      1      1      1      3
10 AFG   Afghanistan  1809      -6     3      1      1      1      3
# … with 17,538 more rows

Tidy data

We consider that a dataframe is tidy if it fulfills the following requirements (Wickham 2014):

  • Each dataframe has one unit of observation.
  • Observations are represented in the rows.
  • Variables are represented in the columns.
  • Each cell indicates a value.

RStudio workflow

Load packages.

library(dplyr)
library(readr)
library(stringr)
library(forcats)

Observations

Observing …

We need to decide which are the units of interest.

What is an observation?

  • Unit of analysis: The thing that we want to know about.
    • Determined by the hypothesis / question.
  • Unit of observation: Each row of a dataframe.
    • Determined by the data.

Ethnic Power Relations, International Conflict Research.

# A tibble: 14 × 5
   countryname  year groupname statusname     groupsize
   <chr>       <dbl> <chr>     <chr>              <dbl>
 1 Belgium      1967 Flemings  JUNIOR PARTNER     0.59 
 2 Belgium      1967 Walloon   SENIOR PARTNER     0.4  
 3 Belgium      1967 Germans   IRRELEVANT         0.01 
 4 France       1967 French    MONOPOLY           0.976
 5 France       1967 Basques   POWERLESS          0.013
 6 France       1967 Corsicans POWERLESS          0.004
 7 France       1967 Roma      DISCRIMINATED      0.006
 8 Belgium      1968 Flemings  JUNIOR PARTNER     0.59 
 9 Belgium      1968 Walloon   SENIOR PARTNER     0.4  
10 Belgium      1968 Germans   IRRELEVANT         0.01 
11 France       1968 French    MONOPOLY           0.976
12 France       1968 Basques   POWERLESS          0.013
13 France       1968 Corsicans POWERLESS          0.004
14 France       1968 Roma      DISCRIMINATED      0.006

Levels of analysis

  • Macro level: States, regions, legal systems.
  • Meso level: Organitzations, ethnic groups, political parties.
  • Micro level: Families, individuals, relationships.
    • Events: Bombings, contracts, terrorist attacks.
# A tibble: 477 × 8
   cowcode region  year country    no  coup successful combat
     <dbl>  <dbl> <dbl> <chr>   <dbl> <dbl>      <dbl>  <dbl>
 1      40      5  1952 Cuba        1     1          1      1
 2      40      5  1957 Cuba        1     1          0      1
 3      41      5  1950 Haiti       1     1          1      0
 4      41      5  1956 Haiti       1     1          0      0
 5      41      5  1957 Haiti       1     1          1      0
 6      41      5  1957 Haiti       2     1          1      0
 7      41      5  1957 Haiti       3     1          1      0
 8      41      5  1958 Haiti       1     1          0      1
 9      41      5  1970 Haiti       1     1          0      0
10      41      5  1986 Haiti       1     1          1      0
# … with 467 more rows

Coup Agency and Mechanisms Dataset

Ecological fallacy

When the UA and the UO are not the same, we run the risk of having an ecological fallacy problem.

Ecological fallacy

Barcelona local elections: District level.

Ecological fallacy

Barcelona local elections: Neighbourhood level.

Ecological fallacy

Barcelona local elections: Census section level.

Variables

What is a variable?

A characteristic of the object we’re studying.

  • It varies across units.
# A tibble: 6 × 5
  region municipality            religion   population suicide
  <chr>  <chr>                   <chr>           <dbl>   <dbl>
1 Isère  Grenoble                Protestant       8250     520
2 Isère  Grenoble                Catholic         1080      72
3 Isère  Le Bourg-d'Oisans       Protestant        325      12
4 Isère  Le Bourg-d'Oisans       Catholic          593      20
5 Isère  Saint-Jean-de-Maurienne Protestant        181       5
6 Isère  Saint-Jean-de-Maurienne Catholic          392      11

Types of variables (1a): Nominal

Unordered categories:

  • Municipality: Barcelona, Sant Cugat, Granollers…
  • Religion: Muslim, Catholic, Shinto…
  • Language: Russian, Catalan, Swedish.
  • Ideology: Conservatism, Nationalism, Liberal…
  • Political parties: PSOE, PP, Cs, ERC…

For strings, stringr (Wickham 2022) | Cheatsheet.

Types of variables (1b): Nominal

  • Storage: Character, factor

  • Operations:

    • Equality: ==
    • Equality: %in%
    • Not equality: !=

Types of variables (2a): Ordinal

Ordered categories:

  • Things: Small, Medium, Large.
  • Age: Child, Young, Adult.
  • Education: Primary, Secondary, Tertiary.
  • Ideas: Disagree, Neutral, Agree.

For factors, forcats (Wickham 2021) | Cheatsheet.

Types of variables (2b): Ordinal

  • Storage: Ordered factor

  • Operations:

    • Equality: ==
    • Equality: %in%
    • Not equality: !=
    • More than: >
    • More or equal than: >=
    • Less than: <
    • Less or equal than: <=

Types of variables (3a): Interval

Numbers, zero is arbitrary.

  • Year: 2004, 2005, 2008, 2010.
  • Temperature (except Kelvin): 10, 25, 30.
  • Ideology: Left-right measured as 0-10.
  • Coordinates: Longitude and latitude.

Types of variables (3b): Interval

  • Storage: Numeric, integer, date

  • Operations:

    • Equality: ==
    • Equality: %in%
    • Not equality: !=
    • More than: >
    • More or equal than: >=
    • Less than: <
    • Less or equal than: <=
    • Sums and differences: +, -
    • Max and min: max(), min()

Types of variables (4a): Ratio

Numbers, zero has meaning

  • Age: 2, 5, 7, 9.
  • Percentages: 0%, 34%, 100%.
  • Population: 200, 3345000, 13000000.
  • Indices (not all of them): 0.245, 0.999.

Types of variables (4b): Ratio

  • Storage: Numeric

  • Operations:

    • Equality: ==
    • Equality: %in%
    • Not equality: !=
    • More than: >
    • More or equal than: >=
    • Less than: <
    • Less or equal than: <=
    • Sums: +
    • Differences: -
    • Multiplication: *
    • Division: /
    • Other: sqrt(), log(), exp(), max(), min(), mean()

Types of variables (V): Summary

Tipus Característiques Vector Operacions
Categòrica nominal Categories no ordenables Caràcter o factor ==, %in%, !=
Categòrica ordinal Categories ordenables Factor ==, %in%, !=, <=, <, >, >=
Numèrica d’interval Nombres, zero sense significat Numèric o enter ==, !=, <=, <, >, >=, +, -
Numèrica de ràtio Nombres, zero amb significat Numèric ==, !=, <=, <, >, >=, +, -, *, / …

Bibliography

Van Evera, Stephen. 2009. Guía para Estudiantes de Ciencia Política: Métodos y Recursos. Barcelona: Gedisa.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.
———. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2022. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.